Suicidal Tendencies and Ideation prediction using Reddit


3. Modelling

In this section, we will be using a Pipeline to score different classifier models like K-Nearest Neighbours and Multinomial Naive Bayes before finally settling on a final production model.

3.1 Establishing a baseline score

We will first calculate the baseline score for our models to "out-perform". A baseline score in the context of our project be the percentage of us getting it right if we predict that all our reddit posts are from the r/SuicideWatch subreddit.

3.2 Selecting the best column to pick our features from

Before moving forward to creating a production model, we will run a Count Vectorizer + Naive Bayes model on different columns and score them. This will help us pick which one that we will use to build more models on.

Note: Understanding our confusion matrix

In the context of our project, these are what the parameters in our confusion matrix represent:

True Positives (TP) - We predict that an entry is from the r/SuicideWatch subreddit and we get it right. As we are seeking to identify suicide cases, our priority is to get as many of these!

True Negatives (TN) - We predict that an entry is from the r/depression subreddit and we get it right. This also means that we did well.

False Positives (FP) - We predict that an entry is from the r/SuicideWatch subreddit and we get it wrong. Needless to say, this is undesirable.

False Negatives (FN) - We predict that an entry is from the r/depression subreddit and BUT the entry is actually from r/SuicideWatch. This is the worst outcome. That means we might be missing out on helping someone who might be thinking about ending their life.

Final choice made: megatext_clean as our "Production Column"

Based on a combination of scores from our modelling exercise above, we will proceed with megatext_clean -- a combination of our cleaned titles, usernames and posts -- as the column we will use to draw features from. Some reasons why:

Generalising Well - The model using megatext_clean's test set scored a 0.67 (the joint highest) while its training set score a 0.95.

High ROC Area Under Curve score - As our classes are largely balanced, it is suitable to use AUC Scores as a metric to measure the quality of our model's predictions. Our top choice performs best there.

Best recall/sensitivity score - This score measures the ratio of the correctly positive-labeled(is in r/SuicideWatch) by our program to all who are truly in r/SuicideWatch. As that is the target of our project, that the model performed well for this metric is important(and perhaps, most important) to us.

False Negatives (FN) - We predict that an entry is from the r/depression subreddit and BUT the entry is actually from r/SuicideWatch. This is the worst outcome. That means we might be missing out on helping someone who might be thinking about ending their life.

3.3 The search for a production model

Inspired by our earlier function, we will create a similar function that will run multiple permutations of models with Count, Hashing and TFID Vectorizers. The resulting metrics will be held neatly in a dataframe.

Narrowing down to two models

The Hashing Vectorizer + Multinomial Naive Bayes model out-performed other models on multiple metrics. Especially our much-prized AUC score(0.77) and the recall score(which measures our model's ability to predict True Positives well). Another notable performer is the TFID Vectorizer + Multinomial Naive Bayes combination. Apart from the joint-second-highest AUC score of 0.73, its consistent performance on both the test and training sets showed that the model generalises well.

Next Step: Tuning Hyperparameters - We'll now move on to make further moves to tweak our hyperparameters for both of these models.

Production Model Chosen: TF-IDF Vectorizer + Multinomial Naive Bayes

The model responded well to the tuning sessions. Although the Hashing model had a slightly better AUC score, I'd prioritise this model's high recall score as it will help predict potential suicide cases(True Positives) more accurately. This model is also proving to generalise pretty well with only a 0.01 variation from its Training to Test set scores.

3.4 Running the optimised production model

Our production model is a combination of two models: TF-IDF and Multinomial Naive Bayes.

The first one, a TF-IDF (or “Term Frequency – Inverse Document” Frequency) Vectorizer, assigns scores to the words (or in our case, the top 70 words) in our selected feature. TF-IDF will penalise a word that appears too often in the document.

A matrix of "word scores" is then transferred into a Multinomial Naive Bayes classifier, which makes predictions based on the calculation of the probability of a given word falling into the a certain category.

Results - The optimised model scored well on out test set, scoring an AUC score of 0.75 . We will proceed to understand our model a bit better before making final critiques and recommendations.

3.5 Model evaluation and possible future developments